Searching for the root of good white wine

by Ignacio Ferreras Astorqui

Before starting this project i did some research in order to get some insights about white wines. The chemicals in the wine are only the 2% of the composition of wine. The other 98% are water and alcohol. Some studies have shown that there are no specific chemicals that are directly related to the quality. Now that we know this I am prepared to face the fact that the plots might not show any great differences.

Univariate Plots Section

Before I start plotting it is important that we study the each and every variable for themselves. This will be done in order to get some insight of our data.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol      quality  classification
##  Min.   : 8.00   3:  20   bad   : 183   
##  1st Qu.: 9.50   4: 163   good  :1060   
##  Median :10.40   5:1457   medium:3655   
##  Mean   :10.51   6:2198                 
##  3rd Qu.:11.40   7: 880                 
##  Max.   :14.20   8: 175                 
##                  9:   5

In the summary above we can see the distribution of the quality of our data in a more specific way. For the purpose of the project we are going to establish a “logical barrier” in which we are going to declare good wine as the ones with a 7 or more, normal wine the ones between 5 and 7 (not including 7) and finally as bad wine everything under 5. I hope that can help us to infer how the different variables affect the quality. This has been decided due to the reduced numbers in high grade wines, with a 9 for example, because we have to face the fact that in such reduced numbers an outliers can greatly affect the conclusions we are going to establish. However, if we decide to establish a trend rather an specific value, we will be able to predict how some values in our variables affect the quality of our wine. For example if we see that the fixed.acidity in the good wines tends to higher values and the normal and bad ones concentrate in lower ones, it will be safe to assume that a higher value in fixed.acidity will affect the final quality of our product. Given that our main objective is to asses which varaibles or characteristics affect primarily to the quality of our wine, first I think its important to know the distribution of our data. As we can see a great number of our wines score a 6 and just a little percentage scores an 8 or higher. Once we know that the amount of these wines is reduced we need to observe how they are shown in the following plots. A we can see there seem to be some gaps between the different levels of alcohol, it is curious not to see a continuous flow that could be expected. However we can see that there is still a flow that generates a mean of 10.51 which may seem strange given that the highest values fall near 9.
The density values are really close to one. After some investigation we found that waters density is 1 so given that wine contains alcohol, which has a density bellow 1. Knowing this we might ask ourselves how come there are wines with density over 1? This could be due to the fact that some wines could have a great amount of chemicals with a density much greater than wine. In the next part it could be interesting to see the comparison between the alcohol and density features.

Univariate Analysis

The univariate analysis has helped to find interesting features in our data as well as confirming which one will be used for our study.

What is the structure of your dataset?

This dataset has almost 5 thousand entries. In each of them a different white wine has been analyzed. This analysis is shown in the data in the form of 12 different chemical characteristics or features. These features, as we have seen in the previous plots go from well know chemicals such as sugars to more specific ones like the free sulfur dioxide. In addition to that there is a variable which illustrates the quality of the wine. This variable has been calculated doing the mean of 3 different marks the wine received by 3 different professional wine testers.

What is/are the main feature(s) of interest in your dataset?

The main feature of this dataset is the quality. Given that this feature is the only one that its value, in theory, should depend on the other variables. I suspect that there is no direct relation between any of the features and the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think there is no one feature that can help. However there may be some interesting findings when we compare multiple features taking into account the value of each of the entries.

Did you create any new variables from existing variables in the dataset?

Yes. I created the variable named classification. This variable is a factor which encompasses different wines given their given quality. Using the following rule: * 3-4 grade: bad quality wine * 5-6 grade: medium quality wine * 7-9 grade: high quality wine

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, as we saw earlier the alcohol distribution had some strange bars or maybe the fact that there were not there at all. In addition the distribution isn’t symmetrical, which doesn’t seem to happen with any other feature. In regarding to the operations I have cleaned the data so we could get rid of the outliers, in order to have a much better understanding of our data. Also I changed the type of the quality variable to a factor so we could use it better for our plots and summary data. For all the plots I am using an special set of colors, given that ggplot2 colors of the values didn’t help to recognize the possible relationships between the data. To fix that I am using the RcolorBrewer library which provides multiple sets of colors for our data. In this case I am using the Dark2 color set.

Bivariate Plots Section

Its time to start plotting our data. In this comparison we are going to see the distribution of each feature in our dataset with every value colored by their quality or classification. Once we have found the variables that show a relationship between itself and quality we will develop it further so we can learn as much as we can from this dataset.

As we can see the biggest part of the results are between five to eight of fixed.acidity approximately. There doesn’t seem to be any changes in the behavior of the data regarding the quality in this two plots.Unfortunately I don’t seem to notice any direct relationship between the quality and the fixed acidity.
This plots doesn’t show us any trend of the quality due to the volatile.acidity. However we could speculate that the data is somewhat ordered given some “layers” around the 1000 and 3000 values in X. That can be more easily spotted in the first plot due to the fact that there are less colors. The previous plots doesn’t show any particular relationship between the citric acid and the quality of the wine. There are multiple values of different qualities with the same values of citric acid. That doesn’t mean there is no relationship between the citric acid and the quality, it means that there is no direct relationship between them. However there could be an indirect relationship.
This plots show a little more promise, given the fact that there seems to be a bigger concentration of high quality wine with a residual sugar near 0. In order to look into it I will take out the outliers so we can see the data more thoroughly. Now that the outliers have been removed we see that there is a pretty significant build up of good quality wines with a residual sugar from 2 to 5. In the next part this variable will be investigated further so we can prove its relationship with the quality variable. Thanks to this plot we can see that there could be a relationship between the chlorides and the quality, given that there seems to be a concentration of good quality wines for chloride concentrations between 0.02 and 0.04. However thanks to some tweaking of the data and some visual comparisons of the two plots, each showing the different factors. Thanks to it, we realized that there is no correlation between the two variables. As the previous plots this one doesn’t show any special patterns indicating that there is a direct relationship between quality and the free sulfur dioxide quantities. These plots are looking for a relationship between the quality and the total sulfur dioxide. As we can see there doesn’t seem to be any direct relationship between the variables. These plots are showing the relationship between the density of white wine and its quality. However it doesn’t seem to be any correlation between the two of them. Thanks to some tweaks, was able to see how the low quality wines are spread across the different values, the same happens with the medium quality ones (5-6). As we can see the different levels of pH doesn’t seem to have any effect in the quality of the wine. It wasn’t even necessary to separate the different qualities in order to see the lack of correlation. Due to the fact that the values are completely scattered above the plot. The different values of sulphates across our dataset show us a non existing relationship between the sulphates and the quality of the wines tested. This only serves to reinforce the idea I stated at the start of the project, that there are no specific chemicals that make a wine good, but a combination of all. This last group of plots show a slight tendency of the high quality wine towards the highest alcohol values. As we can see the medium quality wine tend to have alcohol levels between 8.5 and 11. However after some tweaking of the data I realized that the high quality wines are spread across the whole range of values, which means that there is no direct correlation between the alcohol and the quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

During this part of the project I saw some of the features with a possible relationship with the quality. As we have seen the alcohol seems to have an important relationship with the quality. Density also showed some promise so it will be interesting to see how that relationship develops when we compare alcohol, density and quality or classification. We have seen some other possible correlations but really slim ones, for example in residual sugars at first glance we could say there is a correlation with the quality. However after looking into it with more detail we see that that possible relationship is no more than a plotting “mistake” in which the good quality points are plotted over the medium quality ones. Letting us know that there is not a significant correlation between these features and quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

In this part of the project we focused on the relationships with the main feature of interest. However thanks to some research about the world of wines and chemistry, I have found out that there are many related features. Such as alcohol and density, alcohol and total sulfur dioxide. In addition to the more obvious ones like total sulfur dioxide and free sulfur dioxide or volatile acidity and pH.

What was the strongest relationship you found?

The strongest relationship found is the one between alcohol and quality when we are focusing on the main feature of interest. Because it can be easily seen in the plot that there seems to be an apparent relationship between the both of them. However if we talk about the relationships between non main features of interest the strongest relationship would be between the alcohol and density. This relationship will be looked into it in the next part of the project.

Multivariate Plots Section

Now that we can see the correlations for every variable in our dataset I am going to dig into the most promising ones in order to understand its relationship and to visualizeit more clearly. In this visualization it can be clearly spotted the progression of our points given their quality. As we can see the quality increases when the density lowers and the alcohol volume increases. This might be explained by the fact that the density of alcohol is lower than the density of the water, to be precise is 0.789 g/cm^3, whereas the water has a density of 1 g/cm^3.

Now we are going to look into the relationship between the total sulfur dioxide and the free sulfur dioxide, but as it can be easily seen there is a relationship given that both of the variables are about the sulfur dioxide. As we can see this correlation doesn’t give us much insight about how this affects quality, given that the different qualities have points scattered across the plot. In order to prove this we executed the code showing different layers each time and we prove that the combination of values of free and total sulfur dioxide doesn’t have a direct impact in the quality of the wine.

Sulfur dioxide plays two important roles. Firstly, it is an anti-microbial agent, and as such is used to help curtail the growth of undesirable fault producing yeasts and bacteria. Secondly, it acts as an antioxidant, safeguarding the wine’s fruit integrity and protecting it against browning. This explains the need for having sulfur dioxide in the wine and why great quantities of it can be harmful to our quality. We can even see in the plot that some of the points are “far” from the main group meaning higher values and as we can see all does wines are classified as bad quality and some as medium.

Again we see some correlation between two different variables, but it doesn’t seem to really relate to the quality. After some research we found out that the pH stands for “Potential of Hydrogen” which is a numeric scale used to specify the acidity or basicity. So once we look at the pH table we see that the values that we are working here are for extremely acidic substances, which relates with the fixed acidity values given that they are providing information about the same concept.

Acids impart the sourness or tartness that is a fundamental feature in wine taste. Wines lacking in acid are “flat.” Chemically the acids influence titratable acidity which affects taste and pH which affects color, stability to oxidation, and consequently the overall lifespan of a wine (Water house lab UC DAVIS university). Thanks to the information provided by UC Davis university we know that the acidity has impact in the taste, however none significantly in our data. This could be due to the fact that the acidity of a wine is related to where was it harvested. As said by UC Davis university “wines produced from cool climate grapes are high in acidity and thus taste sour”.It would be a nice addition to the dataset the location of the different wines so we could gain some insight about the wines tested.

Multivariate Analysis

For the purpose of this part of the project I decided to use the library GGally in order to develop a comparison graph of all the different features of the dataset. In addition to that the ggpairs function also shows the correlation between the compared variables. Finally I decided to apply coloring of the data some the points will be colored by their established quality.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

After some hard look at the different graphs of our data, I found one relationship that has an indirect correlation with the quality. That is the relationship between the density and the alcohol of the wine. As we can see thanks to the correlations calculated by ggpairs we can see that the correlation between the quality and the density-alcohol relationship is: * bad = -0.683 * medium = -0.738 * good = -0.844

As we can see the relationship between them is really strong even though is inversed. It is also good to know that this relationship becomes stronger as we go towards higher qualities. This has been the strongest relationship we have come across during our bivariate study of this dataset. As explained earlier this is due to the lighter density of the alcohol. This gives us some explanation of why there seemed to be some kind of relationship between the alcohol and the quality, but with a more easy to spot clarification.

We also have studied more relations like between free and total sulfur dioxide and quality but we haven’t gotten anywhere with it given that there didn’t seem to be any real relationship between them and quality, we might qualify it as a false positive. The same can be said for the fixed acidity and the pH. It must be said that for a person well-versed in chemistry, this relationships would have been obvious from the start, as pointed out by some colleagues I consulted during my research. However it is always good to be able to prove a not knowing relationship due to the fact that that pushes you to research and understand the dataset I am working on further.

Were there any interesting or surprising interactions between features?

Here we could talk about how many of the variables in our dataset are dependent from one another. We have learn that the volume is intimately related to the volume of alcohol, as well as the pH and the fixed acidity. There are more obvious ones like the free sulfur dioxide and the total sulfur dioxide.

In addition we can also relate the residual sugars with the alcohol and the sulfur dioxide. Because excessive sulfur dioxide can prevent the yeast from converting all the sugar into alcohol thus affecting the other variables, decreasing the alcohol levels and rising the residual sugars. In a more direct manner we could say that the residual sugar and the alcohol are directly related given that alcohol is derived from it, and thanks two the previous sentence we now know why the sugar conversion to alcohol is stopped.

What was the strongest relationship you found?

The strongest relationship found is the one between the density-alcohol with the quality. Thanks to ggpair we found out a nearly perfect downhill linear relationship. And once we knew that the relationship between alcohol and density it was easier to find out how it affected the quality. It is good to know that we have some relationship that can be attributed to the quality of the wine, even though it is not the only one it is the biggest with a correlation value around -0.8.

The rest of the variables studied have some kind of relationship with the quality but it is not a great one. I continue with the hypothesis that the quality can not be tied to an specific but to the whole group of variables, given that many of them have relationships between them.


Final Plots and Summary

Plot One

Description One

This is one of the main univariate plots of the project. As we can see it tells us the distribution of our dataset given our main feature of interest. The bars show an practically symmetrical bar graph in which the biggest number of wines fall under the level 6 of quality. For the purpose of the study the number of elements is not as many as I would have wanted because making models with data which a great percentage fall under the same quality level is not going to give us as good results as we have wanted. In addition for all the future plots the results will not be ideal given that in order to compare qualities don’t have enough data to be certain of possible distributions due to the fact that we have to account for possible outliers. ### Plot Two

Description Two

This plot is one of the most significant in the bivariate comparison. It may not show an specific correlation between the alcohol and the quality but it shows great promise. Due to the apparent grouping of the high quality wines in the alcohol volumes over 11. This made me think of how far the relationship between alcohol and quality might go and which other features might be involved. This plot has been colored with the classification variable because with the quality feature there were to many colors and the predominant one (6) collapse all the others. This way we try to bring some equilibrium to the dataset range.

Plot Three

Description Three

This plot displays the possible relationship between alcohol density and quality. As we can see there is an existing relationship between alcohol and density, as we explained earlier. In addition, if you recall from the first part of the project I talked about a possible relationship between alcohol and quality, but it wasn’t as clear as I would have wanted.

After just one quick glance at the plot we can see a progression of our data. The flow goes from the high density wines with a low level of alcohol, being the majority of those medium and bad quality ones, to a lower density and higher alcohol level wines which see to be high quality wines too. So as we saw in the ggpairs plot there is a downhill linear relationship which means that there is a inverted correlation. This has been the strongest correlation found in our dataset.

========================================================

Reflection

This projects objective is to find a reason in our data for a wine to be of good quality. For the purpose of it we have developed multiple plots and faced different problems along the way. The structure of the project is incremental in the number of elements involved in the comparisons, this was made in order to understand the data first and then expand the possible relationships between the features no mater how intricate these relationships may be.

It has helped me to focus my research onto more specified questions abut wine thus helping me finding how chemicals work and to see how much effort has been put into answering questions like this related to wine. Given than my chemistry background wasn’t as wide as I would have wanted it posed even more challenges because in order to understand the data, first I have to understand what the features of study are, in order to know the relationships of some features outside our dataset.

Regarding the creation of a model I didn’t feel confident enough in the relationships between the main feature of study and the rest of the dataset so I decided not to develop a model. Maybe with more evenly spread data I could have developed good linear model. To conclude I can say with a fair amount of confidence that there is no specific feature that is directly related to the level of quality of the wine. However each and every one of the feature contribute to the quality, ones might be affecting the taste, others the alcoholic levels or even the smell.

For future work I would love to see this dataset complemented with extra features such as where the wine is from, more specific chemical compounds because there might be sugars with different objectives for example. Also it would be amazing to have the quality feature divided into two or three subgroups based in smell, taste and color for example. This could help to understand how each different feature affect one specific part of the wine. The feature of the location of the wine could be important to know due to the fact that the different regions of the world are have a different chemical composition of their soil and this affects the wine brewed there greatly. One last feature that could be really interesting to know is the year when the grape was gathered because the quality of the wine of one winery can differ greatly from one year to another, due to multiple aspects. I think that with some of the features i presented here I could obtain a more refined answer to which features of the wine have a bigger impact of its quality.